Here we wanted to show the number of games being played by players to
quickly discuss this. We thought going with a 20 game minimum was the
best idea for analysis.
We did Principal Component Analysis to show if any variables stand out more than others.
We wanted to show with elbow plot that not many components were
really needed. Note we took out the “Entry Level Contract” type and
filtered that players must have played 20 games
We then wanted to see in the first two components what variables had
the most weight as this could help us in our reduction methods.
Now look at the regression model to see what the leading coefficients
are.
Here we can look at how the correlations are connected with each
other.
Here we examined the RMSE of some different regression models to see
which one was the best.
We wanted to show what the ridge plot looks like.
Also wanted to show the lasso plot.
Next, we wanted to show how all the models compared. We took the
linear regression model and compared it to ridge (which has an alpha
level of 0) and increased the alpha by every quartile until it got up to
1 (which is a lasso model). Here we wanted to show which model was the
best.
Here we wanted to show a plot of the top 100 most important variables
in this type of dimension reduction technique.
We did the same thing subsetting the data by forwards.
And lastly did the same random forest technique subsetting it by
defense.
By having these different plots, we tried to see if we could find a pattern somewhere that would give us the most important variables to use in our reduction.
We first looked at the all data, without subsetting by position.
## # A tibble: 74 × 2
## value n
## <chr> <int>
## 1 age 3
## 2 all_assists 3
## 3 all_diff_penalty_minutes 3
## 4 all_i_f_giveaways 3
## 5 all_i_f_goals 3
## 6 all_i_f_shots_on_goal 3
## 7 all_shots_blocked_by_player 3
## 8 diff_5on5_high_danger_goals 3
## 9 diff_all_high_danger_shots 3
## 10 games_played 3
## # … with 64 more rows
We then did the same doing by forwards.
## # A tibble: 71 × 2
## value n
## <chr> <int>
## 1 age 3
## 2 all_assists 3
## 3 all_diff_penalty_minutes 3
## 4 all_faceoffs_lost 3
## 5 all_i_f_giveaways 3
## 6 all_i_f_hits 3
## 7 all_i_f_shots_on_goal 3
## 8 diff_4on5_corsi_percentage 3
## 9 diff_5on5_high_danger_goals 3
## 10 diff_all_corsi_percentage 3
## # … with 61 more rows
And lastly looked at the defensive players.
## # A tibble: 74 × 2
## value n
## <chr> <int>
## 1 age 3
## 2 all_assists 3
## 3 all_i_f_giveaways 3
## 4 all_i_f_hits 3
## 5 all_i_f_shots_on_goal 3
## 6 diff_4on5_corsi_percentage 3
## 7 diff_5on5_goals 3
## 8 diff_5on5_high_danger_goals 3
## 9 diff_5on5_high_danger_shots 3
## 10 diff_all_corsi_percentage 3
## # … with 64 more rows
By using random forest, we could look at what the model says is the
optimal number of predictors we should end up using.